Exact Solution of the Munoz-Eaton Model for Protein Folding 
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A transfer-matrix formalism is introduced to evaluate exactly the partition function of the Munoz- 
Eaton model, relating the folding kinetics of proteins of known structure to their thermodynamics 
and topology. This technique can be used for a generic protein, for any choice of the energy and 
entropy parameters, and in principle allows the model to be used as a first tool to characterize the 
dynamics of a protein of known native state and equilibrium population. Applications to a /3-hairpin 
and to protein CI-2, with comparisons to previous results, are also shown. 
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Recent experimental findings on the folding of small 
proteins suggest that, despite the complex microscopical 
dynamics of the proteins in solution, the overall char- 
acteristics of the folding kinetics (e.g, the equilibrium 
rates) could be quite simple, and principally related to 
the topology of the native state. This has led to the con- 
struction of some simple models [0, ||, |[ ||, ||] that, basing 
on the knowledge of the native fold, aim to predict the 
kinetic properties and their changes upon mutations. 

Even if, usually, the interest in the folding is related 
to prediction of the native structure, the understanding 
of the folding process of proteins of known structure is 
nonetheless of great importance, both for theoretical rea- 
sons, and because important diseases are due to the ag- 
gregation of partially folded or misfolded proteins . 

The importance of these models is that they are simple 
enough to be dealt with quite easily, and have proved to 
provide relevant information on the folding process as a 
whole at physiological temperatures. They are useful to 
test the hypothesis that a reaction coordinate exists, with 
one or a few major rate-limiting steps, and that motion 
along it is ruled by the formation of just the native con- 
tacts, so that all the others can be effectively averaged 
out. So, in a sense, they are complementary of the more 
realistic all-atoms models, that, due to their computa- 
tional cost, can be applied to characterize just a small 
part of the folding process at room temperature, or to 
follow the unfolding at unrealistically high temperatures 
from a biological point of view. 

In this letter we focus on the Muhoz-Eaton (ME) 
model |Q, M , where a protein is described by a chain of 
peptide bonds that can live in just two states: native and 
unfolded. Two residues can interact only if they do so in 
the native structure and if all the intervening residues in 
the chain between them are native. 

In the application to several proteins S, the authors 
do not consider a detailed dynamics, but rather relate 
the relaxation rates to the free energy profiles Fj (i.e. 
the free energy as a function of the number of native 
bonds j). Moreover, to reduce the huge number of con- 
figurations that any protein can live in, even after the 
discretization of the states, they calculate these profiles 
resorting to the "single/double/triple sequence" approx- 
imation (SSA/DSA/TSA): they consider only the con- 



tribution to the free energy of configurations where just 
one/two/three stretches of native bonds are present, thus 
effectively lowering the entropy of the unfolded state. Ex- 
act solutions of the ME model under the assumption 
of homogeneous (i.e. residue independent) interactions 
have been carried out in [^j for the /3-hairpin and the a- 
helix structures, together with a mean field approxima- 
tion whose accuracy, in comparison with a Monte Carlo 
simulation, was assessed for protein CI-2. 

The main result in this letter is the presentation of a 
way to drop these approximations and calculate the exact 
free energy, correlation functions and any other relevant 
thermodynamical quantity for any given protein, within 
the model assumptions. The existence of an exact so- 
lution is interesting for at least two reasons: the first, is 
that, in general, exact solutions are rare, and provide use- 
ful benchmarks to test approximations and simulations. 
The second is that, in this case, the solution is also easy 
to implement, and allows the evaluation of a protein free 
energy in a few seconds of CPU time. As an illustration, 
we apply our technique to the C-terminal /3-hairpin of 
streptococcal protein G Bl and to protein CI2, finding 
relevant differences from approximate results. 

In the ME model |], Q g] the state of a protein of 
N + 1 residues can be described by a binary variable 
rrii for each peptide bond i = l,...,N (peptide bond 
i connects residues i and i + 1). The values = 0, 1 
indicate that the bond is unfolded or native, respectively. 
The hamiltonian (indeed a free energy function) reads 
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The first term assigns an energy ey < to the contact be- 
tween bonds i, j (also accounting for the contact between 
residues i, j + 1). This energy gain is present provided 
that all the bonds from i to j are native, and that the 
contact exists also in the native structure (Ay- = 1 in 
that case; Ay = otherwise). The second term repre- 
sents the entropic cost As; < of ordering bond i in the 
native state. From the above equation, it follows that 
the contribution w({rrik}) = exp[— H({rrik})/(RT)] of 
any configuration {m^} = (mi, 7X1%, . . . , mjv) to the par- 
tition function Z = X){ mfc } w ({ m k}) is just the product 
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FIG. 1: 2D description of a configuration (i.e {mi} = 
{rrii t i} — (0, 1, 1, f , f , 0, 0, f , I, I)) of the original model, for 
N — 10. Filled symbols represent l's; empty symbols 0's. 
Notice that, in any configuration, the l's will group in trian- 
gular regions close to the diagonal, due to Eq. (I). Hence, in 
each row j, all the l's will stay on the right, yielding just j + 1 
possible states for the row. In evaluating the transfer matrix, 
the square angle vertex of the triangle, (j, i) (here (5,2) and 
(10,8)), will be attributed the weight W^%. 



of the weights of the stretches of native bonds contained 
in that configuration; namely: 



= exp 
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from row j + 1 to row j will be defined by its action on 
the vectors v J k +1 by the equation: 
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v k-i ' for fc = 1, 
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where A is a dummy variable whose exponent takes into 
account the number of l's (and hence, of native residues) 
being introduced at row j. This will serve to generate the 
free energy profiles at fixed number of native residues, 
as explained below. In components, observing that, for 
each j, vl (k = 0, . . . , j) can be considered the basis of a 
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(j + l)-dimensional space, we write (v k )i 



(Qi|_l(A)) = (1 — fym-l + *W A* 1 Wjj +2 -i, 

(5) 

where I = 1, . . . , j + 1 and m = 1, . . . , j + 2. Thus, Qj+x 
are rectangular matrices; the biggest one, Qjv+ij i s an 
(N + 1) x (N + 2) matrix. 

The partition function of the model will be calculated 
as Z = Z(X = 1), where 

N 

Z{\) = ^ < +1 = Z3 * ■ (6) 
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for a native stretch going from bond i to bond j (i < j); 
for later use we define wjj+i = 1. 

We start by mapping the original one-dimensional (ID) 
model, with long range interactions, to a 2D one, with 
nearest-neighbour interactions. We do so by introducing 
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whence the obvious constraints 
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= 0, 1 and rriji — 



rrij-i^i 77ij,i+i follow. The 2D representation of a generic 
configuration of the original model is reported in Fig. [l]. 
It can be shown that, in the new variables, it is possi- 
ble to apply the Cluster Variation Method |)| to write 
a variational free energy function, whose minimum pro- 
vides the exact free energy of the model ]l0| . Here we 
follow an alternative route, and notice that the states of 
row j below the diagonal in the 2D model are completely 
defined specifying the number of l's in the row. Notice 
in fact that all the l's stay necessarily at the right end of 
the row (Fig. |l|): this is the key observation, allowing the 
introduction of an efficient transfer matrix formalism to 
solve the model. In fact, while the diagonal (i.e. the set 
of original variables of the model) can assume 2^ config- 
urations, row j in the lattice (but the same is also true 
for the columns) can assume only j + 1 states, according 
to how many l's are present. Let v J k represent the state 
with k l's (k = 0, . . . , j) of row j. The transfer matrix 



Here and below H}(A) = Q\ +1 {\) ... Q J ^ 1 (X). From 
the above expressions it is clear that Z(\) is the gener- 
ating function for the contributions Zj coming from the 
configurations with fixed number of native bonds j: from 
here we recover the free energy as a function of the native 
bonds as 



F 3 = - RJ 'log Zj 



(7) 



In the same fashion it is possible to evaluate the aver- 
age values of rrijj, Eq. (|^): introducing the matrix Pij'- 
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for k < j - 
otherwise, 



we will have 
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Since 



= 1 represents a native stretch going from i 



to j, (rrijj) is the probability of observing such a stretch. 
Other important quantities in defining the folding path- 
ways are the = ((1 - n^-iXIliLi Tn fc)( 1 _ m j+i))> 
which represent the probability of a native stretch going 
from bonds i to j, preceded and followed by a non- native 
bond. They can be calculated as 



Qj+i Oj+ij+i 



n J+1 v 

11 JV+l v < 



(10) 



3 



1 3 5 7 9 11 13 15 




2 4 6 8 10 12 14 
number of native bonds 

FIG. 2: Free energy profiles for the hairpin according to 
Eq. (Q): x-axis: number of native bonds; y-axis: contribu- 
tion to free energy (in Kcal/mol) .Exact solution (solid line); 
modified SSA (dotted line; see || for details). Here T = 297 
K, AH hb = -1.09 Kcal/mol, AG SC = -2.03 Kcal/mol, 
AS C onf = —3.12 cal/ (K mol), see text. Notice that the native 
state minimum corresponds to just eleven native bonds: the 
native state is such that the hydrophobic cluster is formed, 
but the terminal residues are disordered. 




FIG. 3: Density plot of — log/ij^ (see Eq. (|10|)) for the hair- 
pin: darker regions correspond to more probable stretches. 
It is possible to recognize the unfolding pathway as (13, 3) — 
(12, 3) -(11, 3) -(11, 4) -(11, 5) -((10, 5) or (11, 6)) -(10, 6)- 
((9, 6) or (10, 7)) — (9, 7), where the position (j, i) in the matrix 
corresponds to a native stretch from bond i to j > i. 
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Q 3 , (A = 1); the matrix Eij propagates 



only the state with a stretch of j — i + 1 native bonds 
between i and j: 



E hj v k = § k,j-i+l V 3 



while Ojj forces bond j to be non native: 

°3,3 v l = § k,0 V 3 . 



(11) 
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We have applied our technique to the 16 C-terminal 
residues from protein G Bl, folding in a /3-hairpin [Q, 
In order to compare our results with those in H, we set 
the contact matrix Ay in Eq. (|l|) according to Fig. 1 
in and specialize As.; and to the values As; = 
A^con/ and = pAH^b + <7AG SC , where p is the num- 
ber of backbone-backbone hydrogen bonds between pep- 
tide bonds i and j, and q is the number of side-chain 
hydrophobic interactions between residue i and j + 1; 
AHhb, AG SC are the (free) energy contributions of hydro- 
gen bonds and "hydrophobic" interactions, respectively, 
while AS con f is the entropy loss upon fixing a bond in its 
native conformation (refer to the original article for a de- 
tailed discussion). We fix AHhb, AG SC , AS con f by fitting 
("113,3), the population of the stretch with formed hy- 
drophobic cluster, to the experimental data in H, Fig. 3. 
We find that the exact solution Eq. (0) yields free energy 
profiles which are smoother than those obtained with 
SSA, with considerably lower barriers (see Fig. ||). As 
expected, the differences are greater in the unfolded part 
of the profile, where the entropic contribution of partially 
ordered conformations dominates. Even if we do not at- 
tempt a microscopic dynamics, upon assuming that a 
single-residue dynamics in the equilibrium landscape may 
represent the true kinetics we can try to characterize the 
(equilibrium) folding barriers. To this end, we consider 
the values of the fijj of Eq. (]l0|). They allow to follow 
the kinetic pathway of the hairpin, since the length of the 
native stretches must increase on approaching the native 



state, and it can be verified that, for the hairpin, it is 
most likely that a n-bonds native stretch is created by 
adding a native site at one end of (n — l)-stretch, rather 
than filling a 1-bond gap between two shorter stretches 
and merging them. Fig. ^ suggests that the barrier in 
the folding/unfolding pathway corresponds to the forma- 
tion of six-bond- long stretches, either from peptide bond 
5 (Y45-D46) to 10 (K50-T51) or from 6 (D46-D47) to 
11 (T51-F52). The presence of a folding barrier thus ap- 
pears as a feature of the model, and not just of SSA, and 



contrasts with the results in 11 



Coming to proteins, we have applied our exact solution 
to the 65 terminal residues of protein CI-2 (2CI2), choos- 
ing the values of e,j Ay in Eq. (|l|) according to the fol- 
lowing procedure. As in B, we define an atomic contact 
to be present if two nonhydrogen atoms, from residues i 
and j > i + 2, are closer than 0.4 nm and we give an en- 
ergy ke to residues contacts involving 5(fc— 1) < n at < 5fc 
atomic contacts. Moreover, we get the native secondary 
structure from ||l2f and assign to the peptide bond pre- 
ceding a residue marked by B, E, G, H, I, T, the entropy 
cost Asi of the "structured" elements, while the other 
symbols will get the entropy Asq of the "less structured" 
parts (i.e. coils, loops). Then, to fix e, Asi, Aso we 
assume that the protein has a two-state thermodynam- 
ics: the equilibrium population of the native state will be 
given by 9 — Z nat /Z, where Z nat is the sum of the con- 
tributions from the right of the barrier in Fig. § Then, 
we fit the calorimetric data for AG = G u — G n in [ p^[ , 
by asking that (1 -6)/6 = exp(-AG/ET). 

We can then study the differences between the 
SSA/DSA and the exact free energy profiles, as well as 
the equilibrium values for the fij i of Eq. (10). The results 
are reported in Figs. ^ and || respectively, at the temper- 
ature of equal native and unfolded population T = 343.54 
K |l3] |. It is clear that SSA and DSA deeply underes- 
timate the configurational entropy of the unfolded state, 
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FIG. 4: Free energy profiles for protein CF2 according to 
Eq. (Q): x-axis: number of native bonds; y-axis: contribution 
to free energy (in Kcal/mol). Exact solution (solid line); SSA 
(dotted); DSA (dashed); see ffl for details. Here e = -0.550 
Kcal/mol, As = -1.327 cal/(K mol), Asi = -3.863 cal/(K 
mol), from the fit of calorimetric data: see text. 
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FIG. 5: Contour plot of -log jij.i (Eq. @) for CI-2: darker 
regions represent more likely stretches. Parameters as before. 



yielding different barriers to folding and unfolding. The 
exact free energy profile is also much smoother than the 
approximate ones. Fig. || suggests that, in the unfolded 
state, the helix (13-24) and the region including a turn 
(51-56) are the most populated. Experimentally, the un- 
folded state is devoid of fixed structure |Q , but the N- 
tcrminal part of the helix, together with the residues L50, 
158 Jig ], constitute the folding nucleus: the model gives 
interesting clues about this, even if with a turn overesti- 
mation |l6| . The helix was already noticed to be partic- 
ularly populated in |j| [TtJ . 

From Fig. || we can also speculate about the fold- 
ing pathway: assuming an equilibrium folding dynamics, 
with the number of native bonds increasing of at most 
one bond per time step, we try to characterize the un- 
folding pathway by asking whether a /ijj will more likely 
drop an end, producing fJ>j-i t i or /z^j+i, or will split in 
two stretches (i, k — 1) and (k + l,j) for some k. Upon 
reverting the unfolding pathway, the above analysis sug- 
gests that the folding starts with the formation of the 
turn at 54-55; it grows up to the last residue and then 
towards the N-terminus; that folds last. Although this 
result correctly suggests that the helix alone cannot nu- 
cleate the folding [0, 14 , this pathway does not agree 



with the experimental one: this could be partially due 
to our one-residue "dynamics"; notice though that the 
model cannot account for interactions as those in the 
folding nucleus, involving residues that are not in the 
same native stretch: this is probably its major limit. 

In conclusion, we have presented the complete exact 
solution for the thermodynamics of the ME model, in a 
form which is also easy to implement and to apply to real 
proteins. We have analyzed the cases of a /3-hairpin pep- 
tide and of protein CI-2, finding in both cases relevant 
differences from previously published results. Our solu- 
tion opens a wide range of applications to real proteins, 
especially after a better characterization of the relation- 
ship between equilibrium properties and folding dynam- 
ics is found: work is in progress along these lines. 

We are grateful to A. Maritan, F. Cecconi and A. 
Flammini for drawing our attention to the ME model, 
for communicating their results prior to publication and 
for fruitful discussions. We are also indebted to M. Ven- 
druscolo for kindly sharing with us his computer program 
to calculate contact maps. 
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